#Load CSV file into Data Frame
df = read.csv("Survey+Response.csv")
col.list = c("Matlab", "R", "Github", "Excel", "SQL", "RStudio", "ggplot2", "shell (terminal / command line)", "C/C++", "Python", "Stata", "LaTeX", "XML", "Web: html css js", "google drive (formerly docs)", "Sweave/knitr","dropbox", "SPSS", "regular expressions (grep)", "lattice" )
#Count Columns with NAs
na.check = df%>%is.na() %>% apply(2,sum)
#Remove NAs
df_clean = df[,which(na.check==0)]
#Create colums initializing at 0
df_clean[,col.list] = 0
for(i in col.list){
#Need an If Statement because of R vs RStudio.
if(i == "R"){
#Use Reg expressions "R,|R$" which looks for "R," and for "R$" which means there is nothing after R (line 87 caused this issue)
fnd = "R,|R$"
#try to find fnd within the vector, return Row # if True
rows = grep(pattern = fnd, x = df_clean$Experiences.with.tools)
}else{
#Same as above
fnd = paste(i, sep = "")
rows = grep(pattern = fnd, x = df_clean$Experiences.with.tools, fixed = TRUE)
}
df_clean[rows,i] = 1
}
Redivide Programs into 4 groups. And rename the super-long names of skill proficiency.
Program <- rep(0,114)
for(i in 1:114){
if(df_clean$Program[i] %in% c("Data Science", "IDSE (master)", "Ms in ds", "MSDS")){
Program[i] <- "MS_DS"
}
else if(df_clean$Program[i] == "Data Science Certification"){
Program[i] <- "Certificate_DS"
}
else if(df_clean$Program[i] == "Statistics (master)"){
Program[i] <- "MA_Stat"
}
else{
Program[i] <- "Others"
}
}
df_clean$Program <- as.factor(Program)
names(df_clean) [c(4:5,7:11)] <- c("r_data_modeling_experience","gender","r_graphics_experience",
"r_advanced_multivariate_analysis_experience",
"r_markdown_experience",
"matlab_experience","github_experience")
Copy the dataset for further use and create new variables: num-skills which is the number of skills one claiming he/she masters and prof-skills which is added by the self-reporting level of one’s experience in certain programming tools to measure his/her proficiency of these tools (scaled from 0 or “no experience” to 3 or “expert”).
mydata <- df_clean
convert_prof <- function(x){
if(x=="None"){
x <- 0
}
else if(x=="A little"){
x <- 1
}
else if(x=="Confident"){
x <- 2
}
else if(x=="Expert"){
x <- 3
}
}
tmp <- apply(mydata[c(4,7:11)], c(1,2), convert_prof)
mydata[c(4, 7:11)] <- tmp
mydata$num_skills <- apply(mydata[,12:31], 1, sum)
mydata$prof_skills <- apply(mydata[,c(4, 7:11)], 1, sum)
mydata$gender[mydata$gender == ""] <- "doesn't matter"
Here we see the mean of students’ reported skills levels. There were 114 total students, and levels ranged on a scale from 0 (experience “none”) to 3 (experience “expert”). We can see that students reported being most familiar with R data modeling, and least familiar with matlab:
| Matlab | GitHub | R Markdown | R Multivariate Analysis | R Graphics | R Data Modeling |
|---|---|---|---|---|---|
| 0.833 | 0.991 | 0.956 | 0.939 | 1.114 | 1.632 |
Here we see a breakdown of students’ R data modeling experience by gender. We see most of the students have reported experience levels of 2, and are “confident” in their R data modeling skills.
A chord diagram could illustrate intuitively the relationship of skills, i.e., the proportion of people who have a skill (e.g. SQL) also have anther skill (e.g. Python). Also it would be good for visualizing the relationship between skills and program of people. Thus it would provid us with basic guidance towards deeper analysis.
To visualize this relationship, we need to selection features (columns corresponding to skillset questions in our case and program column), split each skill into one new column as bitmap (e.g. if 1 in SQL means familiarity for SQL and 0 means not). So the cleaning scripts as describe in previous sectors are used. Here df_clean is further extracted and transformed into our desired data frame.
Another data cleaning script is introduced to select those skills that matters. In other words, if one skill has too few people, we filtered that out.
# filter features
df_mw <- df_clean[,c(12:31)]
colnames(df_mw)[8] = 'Shell'
colnames(df_mw)[14] = 'Web'
colnames(df_mw)[15] = 'Google Doc'
df_mw = df_mw[,-c(13, 16:20)]
majors = separate_major(df_clean)
IDSE = colMeans(df_mw[majors[[1]],])
DSC = colMeans(df_mw[majors[[2]],])
STATS = colMeans(df_mw[majors[[3]],])
Other = colMeans(df_mw[majors[[4]],])
df_mwcd = rbind(IDSE, DSC, STATS, Other)
df_mwbymajor = data.frame(from = rep(rownames(df_mwcd), times = ncol(df_mwcd)), to = rep(colnames(df_mwcd), each = nrow(df_mwcd)),
value = as.vector(df_mwcd),
stringsAsFactors = FALSE)
grid.col = NULL
grid.col[unique(df_mwbymajor$to)] = 'grey'
grid.col[unique(df_mwbymajor$from)] = c('red', 'blue', 'yellow', 'green')
chordDiagram(df_mwbymajor, grid.col = grid.col)
So the chord diagram for program to skills is created. Each degree has a corresponding arc in the circle and each chord (the colorful thick lines inside the circle) connects a proportion of students in each program to their corresponding each skill.
Then we further transformed the dataset for creating a new chord diagram showing the relationship skill-skill relationship.
df_mwskillset = data.frame(matrix(rep(0,dim(df_mw)[2] * dim(df_mw)[2]), nrow=dim(df_mw)[2], ncol=dim(df_mw)[2]))
colnames(df_mwskillset) = colnames(df_mw)
rownames(df_mwskillset) = colnames(df_mw)
sk_list = colnames(df_mw)
for(i in 1:dim(df_mw)[1]) {
for(j in 1:dim(df_mw)[2]) {
for(k in 1:j) {
if((df_mw[i, j] == 1 && df_mw[i, k] == 1) && (sk_list[j] != sk_list[k])) {
df_mwskillset[j, k] = df_mwskillset[j, k] + 1
# Set weight between same skill (Matlab, Matlab) as 0
}
}
}
}
df_mwbyskill = data.frame(from = rep(rownames(df_mwskillset), times = ncol(df_mwskillset)), to = rep(colnames(df_mwskillset), each = nrow(df_mwskillset)), value = as.vector(unlist(df_mwskillset)), stringsAsFactors = FALSE)
grid.col = NULL
grid.col[sk_list] = 1:length(sk_list)
chordDiagram(df_mwbyskill, grid.col = grid.col)
Use ggplot to draw kernel distributions, boxplots, joint distribution by contour of num-skills and prof-skills for different programs.
ggplot(mydata, aes(x=num_skills, color = Program)) + geom_density()
ggplot(mydata,aes(x=Program, y=num_skills, color = Program)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.1, fill = "black") + stat_summary(fun.y = median, geom = "point")
These two plots reflect the distributions of the number of skills of students from different programs. We can see that these four distributions are all skewed to the right while the distributions of Data Science Masters(MS_DS) and students from other programs(Others) have long tails.
ggplot(mydata, aes(x=prof_skills, color = Program)) + geom_density()
ggplot(mydata,aes(x=Program, y=prof_skills, color = Program)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.1, fill = "black") + stat_summary(fun.y = median, geom = "point")
These two plots reflect the distributions of the proficiency of skills of students from different programs. Interestingly, Data Science Certificate students tend to report less proficiency of skills than other three groups in terms of median of the distribution. The distribution of students from other programs has short tail. Besides the distribution of Data Science Masters(MS_DS) tends to be pretty normal distributed.
ggplot(mydata, aes(x=num_skills, y=prof_skills)) + stat_density2d(aes(colour = ..level..)) + geom_point() + facet_wrap(~Program, scales = "free")
This plot reflects the joint distributions of the proficiency of skills and the number of skills in different student groups. Contour lines represent density of this distribution. From this joint perspective we can see that the distributions of Data Science Certificate students and Data Science Masters(MS_DS) have larger density around “peak” than other two groups.
I created a visualization that shows the information of experience with tools across majors. There are two plots made. One compares stats with data science and the other compares the data science master with data science certificates. I chose these two majors because the majority of people are in these two majors. We can compare as many majors as we want if needed. The graph shows for each tool what is the proportion of people who know how to use the tool. I choose the top five tools identified in the random forest classifier. We can clearly see that the patterns for two majors are different. Also, we can use this kind of plot to see what are people from different good at.
Now separate majors and do some calculation over the skills
radar plot ref : http://www.statisticstoproveanything.com/2013/11/spider-web-plots-in-r.html
We will now look at a decision tree to try and understand if we have the ability to predict what program a student is in only the student’s experience with the software programs and tools listed in the survey.
A decision tree was chosen because the intrepetability is high and can give us some insight into what categories help create the purest subgroups using the Gini Index
The training set is set to 80% of the given data and attempt to predict on the remaining 20% to get an idea of how this prediction algorithm might perform. We will also change the randomness of the selection of the training set by using set.seed(). This will allow us to see how high the variance might be for the tree. If the tree greatly changes based on different training sets we are experiencing a high variance.
We can see from the training data the tree has selected “dropbox” for the first split in all four cases. While hte trees are not necessarily performing well we can see that the trees have changed after the first split in every case. This signifies a model experiencing high variance. One way we can work to bring the variance down is using Random Forests, which will randomly select the first split over many decision trees and use a voting process to determine classification. This voting process will decrease the variance that we are currently seeing and should improve overall performance. We will see a benefit as long as the decrease in variance is greater than the increase in bias that will be experienced.
Because of the small dataset Random Forests trains very quickly so we can run multiple training attempts very quickly. We will look at a range of trees to use in the random forest and see how it performs. Random Forests have the a trade off of higher accuracy but harder interpretation than typical Decision Trees.
For the following we will look at accuracy and Importance. Where the imporatnace calculated using the mean decrease in the Gini Index over all of the trees. In more simple terms these graphs show the most important variable that causes the purist division within the data.
Overall we can see a increase in the performance of the predictions with different trees. While we saw a decrease in when set.seed() = 1 for all of the other case we saw an increase as expected. Most likely if we were to recieve more training data we could expect to improve on our prediction. On average over this small training set we were 52.17% accurate.
To understand how we did we can look at the percentages of all of the majors
Overall the largest major within the class is IDSE (master) at 50%. So if we just consistently guessed IDSE we could still do fairly well. The Random Forest only did slightly better at 52.17% on average.
Looking into decision trees helped us gain a better understanding of what factors might help differentiate the programs. We saw consistently that dropbox seemed to play the largest factor in how our algorithm determined which student belonged to which program. However even with 200 trees in the Random forest we still saw the importance ranking change, suggesting our data is still very spread and has a high variance. This is where more data could help the performance.